Language Identification With Confidence Limits
ثبت نشده
چکیده
A statistical classification algorithm and its application to language identification from noisy input are described. The main innovation is to compute confidence limits on the classification, so that the algorithm terminates when enough evidence to make a clear decision has been made, and so avoiding problems with categories that have similar characteristics. A second application, to genre identification, is briefly examined. The results show that some of the problems of other language identification techniques can be avoided, and illustrate a more important point: that a statistical language process can be used to provide feedback about its own success rate. 1 I n t r o d u c t i o n Language identification is an example of a general class of problems in which we want to assign an input data stream to one of several categories as quickly and accurately as possible. It can be solved using many techniques, including knowledge-poor statistical approaches. Typically, the distribution of n-grams of characters or other objects is used to form a model. A comparison of the input against the model determines the language which matches best. Versions of this simple technique can be found in Dunning (1994) and Cavnar and Trenkle (1994), while an interesting practical implementat ion is described by Adams and Resnik (1997). A variant of the problem is considered by Sibun and Spitz (1994), and Sibun and Reynar (1996), who look at it from the point of view of Optical Character Recognition (OCR). Here, the language model for the OCR system cannot be selected until the language has been identified. They therefore work with so-called shape tokens, which give a very approximate encoding of the characters' shapes on the printed page without needing full-scale OCR. For example, all upper case letters are t reated as being one character shape, all characters with a descender are another, and so on. Sequences of character shape codes separated by white space are assembled into word shape tokens. Sibun and Spitz then determine the language on the basis of linear discriminant analysis (LDA) over word shape tokens, while Sibun and Reynar explore the use of entropy relative to training data for character shape unigrams, bigrams and trigrams. Both techniques are capable of over 90% accuracy for most languages. However, the LDA-based technique tends to perform significantly worse for languages which are similar to one another, such as the Norse languages. Relative entropy performs better, but still has some noticeable error clusters, such as confusion between Croatian, Serbian and Slovenian. What these techniques lack is a measure of when enough information has been accumulated to distinguish one language from another reliably: they examine all of the input data and then make the decision. Here we will look at a different approach which a t t empt s to overcome this by maintaining a measure of the total evidence accumulated for each language and how much confidence there is in the measure. To outline the approach: 1. The input is processed one (word shape) token at a time. For each language, we determine the probability that the token is in that language, expressed as a 95% confidence range. 2. The values for each word are accumulated into an overall score with a confidence range for the input to date, and compared both to an absolute threshold, and with
منابع مشابه
Language Identification With Confidence Limits
A statistical classification algorithm and its application to language identification from noisy input are described. The main innovation is to compute confidence limits on the classification, so that the algorithm terminates when enough evidence to make a clear decision has been made, and so avoiding problems with categories that have similar characteristics. A second application, to genre ide...
متن کاملReflective Teaching in ELT: Obstacles and Coping Strategies
The present study aimed to document the constraints and limits in applying reflective teaching principles in ELT settings in Iran from the teachers’ perspective along with solutions and coping strategies to help remove the obstacles. 60 teachers teaching general English at 6 language institutes were selected through convenience sampling. First, the teacher participants filled out a reflectivity...
متن کاملThe Relationship betweenEFL Learners’ Self-Identity Changes, Motivation Types, and EFL Proficiency
This study aimed to explore the relationships between foreign language learners’ self-identity changes, motivation types, and Foreign Language proficiency associated with learning English in private language schools in Iranian context. Based on a stratified sampling, 204 English as a foreign language learners from three language schools in Tehran were selected to participate in the study. The i...
متن کاملEvaluation of confidence measures for language identification
In this paper we examine various ways to derive confidence measures for a language identification system [3], using phone recognition followed by language models, and describe the application of an evaluation metric [1] for measuring the “goodness” of the different confidence measures. Experiments are conducted on the 1996 NIST Language Identification Evaluation corpus (derived from the Callfri...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002